Goto

Collaborating Authors

 remote sensing


WKV-sharing embraced random shuffle RWKV high-order modeling for pan-sharpening

Neural Information Processing Systems

Pan-sharpening aims to generate a spatially and spectrally enriched multi-spectral image by integrating information from low-resolution multi-spectral image and texture-rich panchromatic counterpart. In this work, we propose a WKVsharing embraced random shuffle RWKV high-order modeling paradigm for pansharpening from Bayesian perspective, coupled with random weight manifold distribution training strategy derived from Functional theory to regularize the solution space adhering to the following principles: 1) Random-shuffle RWKV. Recently, the Vision RWKV model, with its inherent linear complexity in global modeling, has inspired us to explore its untapped potential in pan-sharpening tasks. However, its attention mechanism, relying on a recurrent bidirectional scanning strategy, suffers from biased effects and demands significant processing time. To address this, we propose a novel Bayesian-inspired scanning strategy called Random Shuffle, complemented by a theoretically-sound inverse shuffle to preserve information coordination invariance, effectively eliminating biases associated with fixed sequence scanning.



Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables

Neural Information Processing Systems

Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, deep learning-based methods incur substantial computational overhead during inference, especially with large images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for large remote sensing images. Our method makes it possible to process 15K 15K remote sensing images on a 24GBGPU. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details, we introduce the spatial details look-up table (SDLUT).


Rethinking Evaluation of Infrared Small Target Detection

Neural Information Processing Systems

As an essential vision task, infrared small target detection (IRSTD)1 has seen significant advancements through deep learning. However, critical limitations in current evaluation protocols impede further progress. First, existing methods rely on fragmented pixel-and target-level specific metrics, which fails to provide a comprehensive view of model capabilities. Second, an excessive emphasis on overall performance scores obscures crucial error analysis, which is vital for identifying failure modes and improving real-world system performance. Third, the field predominantly adopts dataset-specific training-testing paradigms, hindering the understanding of model robustness and generalization across diverse infrared scenarios.


RSCC: ALarge-Scale Remote Sensing Change Caption Dataset for Disaster Events

Neural Information Processing Systems

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing.


REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Neural Information Processing Systems

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 25%. Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.



IRRISIGHT: ALarge-Scale Multimodal Dataset and Scalable Pipeline to Address Irrigation and Water Management in Agriculture

Neural Information Processing Systems

The lack of fine-grained, large-scale datasets on water availability presents a critical barrier to applying machine learning (ML) for agricultural water management. Since there are multiple natural and anthropogenic factors that influence water availability, incorporating diverse multimodal features can significantly improve modeling performance. However, integrating such heterogeneous data is challenging due to spatial misalignments, inconsistent formats, semantic label ambiguities, and class imbalances. To address these challenges, we introduce IRRISIGHT, a large-scale, multimodal dataset spanning 20 U.S. states. It consists of 1.4 million pixel-aligned 224 224 patches that fuse satellite imagery with rich environmental attributes. We develop a robust geospatial fusion pipeline that aligns raster, vector, and point-based data on a unified 10m grid, and employ domain-informed structured prompts to convert tabular attributes into natural language. With irrigation type classification as a representative problem, the dataset is AI-ready, offering a spatially disjoint train/test split and extensive benchmarking with both vision and vision-language models. Our results demonstrate that multimodal representations substantially improve model performance, establishing a foundation for future research on water availability.


RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Neural Information Processing Systems

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using largescale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, changing detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViTbased counterparts in both accuracy and computational efficiency. The source code and pretrained models were released at RoMA.